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Abstract 

Analyzing the function of gene sets is a critical step in interpreting the results of high-throughput 
experiments in systems biology. A variety of enrichment analysis tools have been developed in recent 
years, but most output a long list of significantly enriched terms that are often redundant, making it difficult 
to extract the most meaningful functions. In this paper, we present GOMA, a novel enrichment analysis 
method based on the new concept of enriched functional Gene Ontology (GO) modules. With this method, 
we systematically revealed functional GO modules, i.e., groups of functionally similar GO terms, via an 
optimization model and then ranked them by enrichment scores. Our new method simplifies enrichment 
analysis results by reducing redundancy, thereby preventing inconsistent enrichment results among 
functionally similar terms and providing more biologically meaningful results. 
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In the post-genome era, a group of genes — instead 
of single gene — is believed to perform biological 
functions, making functional analysis of gene sets an 
important way to understand cell biology 1131 . Functional 
enrichment analysis, a bioinformatics technique that has 
become popular in systems biology, is used to find 
specific and significant functions to annotate a given 
gene set. In the past decade, many enrichment analysis 
tools have been developed, and these can be classified 
into different types based on the algorithms and data 
used. A recent review categorized the 68 existing 
enrichment tools into three classes: singular enrichment 
analysis (SEA), gene set enrichment analysis (GSEA), 
and modular enrichment analysis (MEA) [41 , as shown in 
Figure 1. SEA methods calculate enrichment P value for 
each term on a pre-selected gene list, so the results 
heavily depend on how the pre-selected gene set is 
generated. GSEA methods do not require a cutoff to 
pre-selected genes and instead integrate the attributes of 
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each gene in the whole gene set into the enrichment 
score. MEA methods extend SEA methods by integrating 
the term-term/gene-gene relationships. A biological 
process is typically realized by a group of genes or gene 
products. A group of genes may also participate in 
several biological processes. Gene sets are mostly 
annotated by a set of related annotation terms, named 
term module, instead of the most significant term. From 
this point of view, MEA methods are more reasonable in 
practice. By using the term modules instead of individual 
terms, MEA methods can also solve the functional 
redundancy problem, that is, a set of identified significant 
terms represents a redundant view of the same biological 
process 151 . However, MEA methods have been only 
minimally reported in the literature (e.g. DAVID [6] and 
GenGOPi). 

In this article, we present a novel Gene Ontology 171 
(GO) enrichment analysis method. GO is the most 
important and extensively used annotation database in 
enrichment analysis. In this new method, GOMA, we 
built a GO term network; used an optimization model to 
extract GO modules, groups of densely connected and 
functionally similar terms, from the GO network; and 
ranked GO modules by their enrichment scores. GOMA 
solves the problem of functional redundancy and 
prevents inconsistent enrichment results among func- 
tionally similar terms. Compared with the classic term- 
centric hypergeometric method (CHM), computational 
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results on several real data sets show that GOMA 
provides more meaningful results. 



vector, and W = {w ij ,i,j = I,--- ,n) is the edge weight 
matrix. 



Materials and Methods 

The GOMA work flow is shown in Figure 2 and 
described in detail below. 

Construction of a GO Network 

A GO term network was constructed according to 
the global similarities among GO terms. The nodes of 
the GO network represent the GO terms annotated in the 
input gene list, and the edges denote the global 
similarities between any GO term pairs. The global 
similarity of two GO terms was calculated with 
GOSemSim 181 , which integrates GO-dependent information 
included in the GO hierarchical structure. In general, a 
GO network is presented as an undirected weighted graph 

G = (V, E), where V - {v 19 v 2 ,- • -,v„} is the node (GO 
term) set and E = {(v., v,)]v.,v.eF} is the edge set. 
There are weights a ssociated with nodes and edges. 
Suppose that F = {f l ,f 2 ,---,f } is the node weight 



Extraction of GO modules from the GO network 

The functional GO modules, that is, the densely 
connected subgraphs, were extracted from the GO 
network. There are many module identification algorithms 
for complex networks, but only a few integrate both the 
edge weights and node weights. GOMA is based on an 
optimization model to identify the most coherent 
subnetworks. The optimization model maximizes the sum 
of weights of edges and nodes within a subnetwork while 
constraining the subnetwork size. Therefore, it is suitable 
for finding the most functionally similar GO modules. The 
model can be described formally as follows: 

max X + A T J fi x i 



s.t. 



■+xt 



1 



(1-1), 



x, ^0,i =1,2, 
where max means to maximize the objective function 
and s.t. is the abbreviation for "subject to"; x, denotes the 
degree for node i belonging to the optimal GO module; A 
is a parameter to balance the node and edge weights; and 
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Figure 1. Three types of enrichment analysis tools. Enrichment analysis tools are categorized into singular enrichment analysis (SEA), gene set 
enrichment analysis (GSEA), and modular enrichment analysis (MEA) based on the algorithms and data used [M . The yellow arrows mean the 
developmental process of the enrichment analysis tools. 





© | 


GO 


©, 


GO 


® | 


Ordered 


©, 


GO 


Input 












network 




modules 




modules 




module list 



Figure 2. The GOMA work flowchart. The steps in the GOMA process are as follows: CD construct GO term network; CD extract densely con- 
nected GO modules; (3) rank GO modules by enrichment scores; and (4) output GO module list. 
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is a regularization parameter for the variable x = x 2 , 
x„). The larger /3 is, the smaller the size of the derived 
GO module. By solving the optimization model, we 
obtained a GO module by selecting the nodes with x : 
larger than a pre-determined cutoff. Then these nodes and 
the related edges were removed from the GO term 
network. This process was repeated to derive all densely 
connected GO modules. The optimization model was 
solved by the efficient iterative algorithm proposed by 
Wang ef al.w. 

To extract functionally similar GO terms that 
annotated as many genes as possible, the node weight 
was defined as j] = m/m,i=l, ■■■ ,n, where m, is the 
number of genes annotated by GO term v b and m is the 
total number of genes. The edge weight was defined 
as w t = SjjXdj, where s f is the semantic similarity of GO 
terms v ; and v s , and d :j is the norma lized average GO 

level of two terms. In detail, d V] = d tj I max ; / . {a l' t .} and 

d' iJ =((d i + d J )/2) a , where d,, i=l, 2, — , n, is the GO level 

of term «, (that is, the longest path length from the root 
to term v t in the GO hierarchical structure), and a is a 
parameter to adjust the specificity of GO modules. The 
larger a is, the more specific the derived GO module is. 
We noted that s s is a factor to find the functionally similar 
terms in GO tree structure, and d t is a factor to adjust 
the specificity of GO modules. The goal of functional 
enrichment analysis is to find the specific biological 
annotations for a gene set, and larger GO level indicates 
more specific functional annotations. Therefore, we defined 
w t as the product of s # and d t . 

Ranking of GO modules 

GO modules were ranked according to their en- 
richment scores. A GO module T is considered a pseudo 
term, and a gene is annotated by T if the gene is 
annotated by at least one term in T. To test whether a 
GO module is enriched in a given gene set, the 
enrichment score of T was defined as .sj= (n T /m)(N T /M), 
where n T is the number of genes annotated by T, m is 
the total number of genes in the given gene set, and N T 
and M are the corresponding number of genes annotated 
by T and the total number of genes in the reference set, 
respectively. The statistical significance of GO module 
enrichment was calculated by the hypergeometric test, 
which is extensively used in bioinformatics enrichment 
analysis studies. The null hypothesis is that genes 
annotated by T have the same probabilistic distributions 
in the given gene set and in the reference set. Suppose 
that the random variable % denotes the number of genes 
annotated by T in the given gene set, % should follow a 
hypergeometric distribution under the null hypothesis. 
The P value of GO module enrichment, that is, the 
probability that % would have a value greater than or 



equal to the observed value n T by chance, can be 
calculated as follows: 

(N T \(M-N T \ 
[mj 

The GO module enrichment of T in the given gene 
set is statistically significant if the P value is smaller than 
a chosen cutoff which is given by the users. When the P 
value of one GO module is lower than the cutoff, the GO 
module is considered statistically significant. 

Data sets and model parameters 

We used four data sets from the literature, originally 
analyzed with NOA™, DAVID 16 ' 111 , and GOrilla™, to evaluate 
the proposed method. In the computational experiments, 
the parameters were set as follows: {} = 10, A = 0.08, 
and a = 1. GO semantic similarities were computed 
using an R package GOSemSim 181 , which implemented 
the algorithm proposed by Wang ef a/. [13] . Yeast and 
human GO annotation data were from R packages org. 
Sc.sgd.db 1141 and org.Hs.eg.db 1151 in Bioconductor [161 , 
respectively. All GO hierarchies were generated by 
AmiGO [17] . In this study, we focused on the biological 
process (BP) domain only, though the method also can 
be applied to cellular component (CC) and molecular 
function (MF) domains. 



Results 

Transcription factor co- regulatory network 

The first data set used was the yeast cell cycle 
transcription factor (TF) co-regulatory network 1101 , which 
was downloaded from the NOA web server 1181 . The TF 
network contained 67 genes and the derived GO term 
network had 545 terms as nodes. 

GOMA results are shown in Figure 3. We compared 
the top six statistically significant GO modules found by 
GOMA with individual terms calculated by CHM and 
found that five of six GO modules contained non- 
significant GO terms by CHM method (Figure 3A). Many 
non-significant terms are functional redundancy with the 
significant ones. GOMA collected them together as GO 
modules without the functional redundancy and 
described the biological process more specific with the 
functional similar terms. We also found that the top six 
GO modules had larger average similarities compared 
with the whole GO network (Figure 3B), suggesting that 
the terms in the GO modules were more functionally 
similar than those in the whole GO network. 

Enrichment analysis results were simplified by 
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grouping functionally similar terms together. The first GO 
module mainly described how the transcription process is 
regulated (Figure 3C). (In AmiGO 1171 ontology data, GO: 
00100551, GO:0032583 and GO:0090039 were merged 
with GO:0006357, GO:0006355, and GO:0034243, 
respectively.) While GO:0006355 (regulation of trans- 
cription, DNA-dependent) and GO:0006357 (regulation of 
transcription from RNA polymerase II promoter) are 
general descriptions for the transcription regulation 
process, the other two terms have more specific 
functions: regulation of transcription elongation from RNA 



polymerase II promoter (GO:0034243) and regulation of 
ribosomal protein gene transcription from RNA 
polymerase II promoter (GO:0060962). Although the 
terms GO:0034243 and GO:0060962 were not significant 
in term-centric enrichment analysis, they were major 
contributors to the significance of the two parental terms. 
This example shows that GOMA can find a module 
containing the specific terms, whereas term-centric 
methods can only find the common ancestor of several 
specific terms. The details of other GO modules, which 
contain cell cycle-related functions, are listed in Sup- 
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Figure 3. GO modules of yeast cell cycle transcription factor co-regulatory network. A, comparison with CHM. The X axis is the rank (down) 
and P value (up) of GO modules. The Kaxis is -log m (P value) of individual terms calculated by CHM. The maximal value of the Y axis corre- 
sponds to infinity. The dashed line denotes the P value cutoff of 0.05. There are five of six GO modules containing non-significant terms which 
are functional similar. B, similarity distribution of the GO network and the average similarities of GO modules. The number in the rectangle is GO 
module rank, and the arrow points to the corresponding average similarity. GO modules have high similarities, that is, GO modules are really func- 
tional similar. C, the term structure of the first GO module. The functions of GO modules are as follows: G0:0006355, regulation of transcription, 
DNA-dependent; G0:0006357, regulation of transcription from RNA polymerase II promoter; G0:0034243, regulation of transcription elongation 
from RNA polymerase II promoter; and G0:0060962, regulation of ribosomal protein gene transcription from RNA polymerase II promoter. 
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plementary materials (http://www.cjcsysu.com/enpdf/ 
supp/12-151 .pdf). The data set used in this example is a 
cell-cycle conditional-specific transcription factor 
regulatory network 1191 , and our GO modules support the 
cell cycle process in different and specific angle, that is, 
the top six GO modules contained many cell cycle- 
related functional terms. 

DAVID demo lists 

The second and third data sets used were demo lists 



from the DAVID web server 1201 . Demo list 1 1211 contains 
164 human genes found to be up-regulated in 
CD47CD62L T cells relative to CD47CD62L + T cells. 
Demo list 2 1221 contains 403 human genes found to be 
induced in peripheral blood mononuclear cells incubated 
with purified HIV envelope proteins. After ID mapping, 
the gene sets included 155 and 379 genes while the GO 
term networks contained 2,171 and 3,439 terms, 
respectively. 

Results for demo list 1 are shown in Figure 4. CHM 
revealed that the top six GO modules got by GOMA 
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Figure 4. GO modules of DAVID demo list 1. A, comparison with CHM. The X axis is the rank (down) and P value (up) of GO modules. The Y 
axis is -log m (P value) of individual terms calculated by CHM. The maximal value of the Kaxis corresponds to infinity. The dashed line denotes the 
P value cutoff of 0.05. Many non-significant terms are functional redundancy with the significant ones. GOMA collected them together as GO 
modules without the functional redundancy. B, similarity distribution of the GO network and the average similarities of GO modules. The number in 
the rectangle is GO module rank and the arrow points to the corresponding average similarity. The high similarities for GO modules mean the 
dense subnetworks. C, the term structure of the sixth GO module shows more details about the functional relationships among the terms in one 
GO module. 
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contained non-significant terms (Figure 4A). GOMA put 
these terms together as the GO modules most 
significantly enriched in the entire gene set. We also 
found that the terms in each GO module were similar to 
each other (Figure 4B). For example, in addition to 
individually significant terms such as GO:2000106 
(regulation of leukocyte apoptotic process), GOMA 
identified GO:0033032 and GO:0043523 (regulation of 
myeloid cell and neuron apoptosis process) and the 
functionally similar terms GO:0043065 and GO:0008633 
in the sixth GO module (Figure 4C). 

Differentially expressed genes in demo list 1 could 
be classified into several functional clusters, such as 
receptor and receptor ligand genes, signal transduction 
genes, and effector genes 1211 . The receptor and receptor 
ligand cluster could be further subdivided into chemokine 
and cytokine, cell fate and apoptosis, or trophic and 
growth factor gene product pathways. The first and sixth 
GO modules covered the receptor and receptor ligand 
functions, which is consistent with the original data 
analysis 1211 . The primary function of the first GO module 
was response to stimulus (GO:0050896), and its specific 
functions included cellular response to cytokine stimulus 
(GO:0071345), cellular response to growth factor 
stimulus (GO:0071363), and chemokine-mediated 
signaling pathway (GO:0070098). The sixth GO module 
mainly described cell death and apoptosis related 
functions, such as regulation of apoptotic process (GO: 
0042981) and regulation of programmed cell death (GO: 
0043067). The second GO module covered signal 
transduction (GO:0007165) and more specific signaling 
pathway relevant terms, which corresponds to a 
significant sub gene set with the signal transduction 
function 1211 . The third and fourth GO modules were 
involved in regulation of multicellular organismal 
development (GO:2000026) and regulation of cell 
differentiation (GO:0045595), respectively. The fifth GO 
module corresponded to the effector gene cluster in 
demo list 1 , which is also an enriched function 
corresponding to a sub gene set in the original data 
analysis' 211 and included regulation of lymphocyte 
differentiation (GO:0045619) and regulation of myeloid 
cell differentiation (GO:0045638, GO:0045639). 

Results for demo list 2 are shown in Figure 5. All but 
the first GO module contained non-significant terms 
(Figure 5A). The average GO similarities within GO 
modules are shown in Figure 5B. The structure of the 
fourth GO module is illustrated as an example in Figure 
5C. Compared with the individual significant terms found 
by term-centric methods, GOMA identified the whole 
module related to cell surface receptor signaling pathway 
(GO:0007166). With GOMA, we also identified many 
specific terms that were not individually significant but 
were still related to growth factor receptor signaling 
pathways, such as GO:0048008, GO:0007173, and GO: 
0042058. Module-centric results may provide a more 



complete picture of enriched GO terms in the given gene 
set. 

Demo list 2 consists of the genes involved in the 
transcriptional changes induced by gp120 freshly isolated 
peripheral blood mononuclear cells and monocyte- 
derived-macrophages' 221 . The first GO module contained 
the regulation of biological process and related functions 
and is therefore a general function module. The second 
GO module included the specific functions of signal 
transduction (GO:0007165) that correspond to 
chemokines and modulators of chemokine-related signal 
transduction [22] . The third GO module was related to 
meiosis I (GO:0007127). The primary function of the 
fourth GO module was regulation of growth factor related 
signal pathways, including terms such as regulation of 
vascular endothelial growth factor receptor signaling 
pathway (GO:0030947), which is significantly related with 
the original data analysis' 221 , and regulation of epidermal 
growth factor receptor signaling pathway (GO:0042058). 
The fifth GO module involved regulation of receptor 
activity (GO:0010469), and the sixth module involved cell 
differentiation (GO:0030154). 

Breast cancer data 

The fourth data set used was differentially expressed 
genes in breast cancer samples' 231 originally analyzed 
with GOrilla' 121 . Genes in this list were ranked according 
to a simple f test for one group of 44 patients who 
survived longer than 5 years and another group of 33 
patients who survived less than 5 years. The top 927 
differentially expressed genes were included in the gene 
list. After deleting genes with unresolved identity and 
without GO annotations, our final list contained 775 
genes and 4,144 GO terms. 

GOMA results are shown in Figure 6. The top six 
GO modules contained both significant and non- 
significant terms (Figure 6A) and had large internal 
similarities (Figure 6B). Figure 6C shows the hierarchical 
structure of the first GO module. GOMA revealed many 
functions specific to the cellular process (GO:0009987) 
and clustered the meiosis process related GO terms into 
one functional module. 

Cell cycle phase (GO:0022403) and reproduction 
(GO:0000003) functions were enriched in the breast 
cancer data set. The first GO module grasped their 
specific offspring terms: meiosis (GO:0007126), 
pachytene (GO:000239), and other related functions. 
The second GO module involved translation-related 
functions, such as translational initiation (GO:0006413) 
and posttranscriptional regulation of gene expression 
(GO:0010608). The third GO module involved regulation 
of cell cycle process (GO:0010564) and the more 
specific regulation of meiosis and S phase (GO: 
0040020, GO:0033261), which is consistent with the first 
module. The fourth and fifth modules also included 
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Figure 5. GO modules of DAVID demo list 2. A, comparison with CHM. The X axis is the rank (down) and P value (up) of GO modules. The Y 
axis is -log w (P value) of individual terms calculated by CHM. The maximal value of the Kaxis corresponds to infinity. The dashed line denotes the 
P value cutoff of 0.05. Many non-significant terms are functional redundancy with the significant ones. GOMA collected them together as GO 
modules without functional redundancy. B, similarity distribution of the GO term network and the average similarities of GO modules. The number 
in the rectangle is GO module rank and the arrow points to the corresponding average similarity. The high similarities for GO modules mean the 
densely subnetworks. C, the term structure of the fourth GO module shows more details about the functional relationships among the terms in 
one GO module. 



functions involved in the cell cycle process. The sixth 
module included signal transduction (GO:0007165) and 
other signal pathways related functions. All terms in the 
GO modules and their functional information can be found 
in Supplementary Materials (http://www.cjcsysu.com/ 
enpdf/supp/1 2-1 51.pdf). 

Discussion 

In this paper, we proposed a new functional 



enrichment analysis method, GOMA, in which a GO 
term network is built based on term-term relationships 
and GO hierarchical structures. Functional enrichment 
results are given as a list of enriched GO modules for a 
given gene set, which reduces functional redundancy. As 
illustrated in the real data examples, GOMA can find 
meaningful and consistent functional GO modules from 
given gene sets, making it a useful tool. 

Functional redundancy poses a problem in traditional 
bioinformatics enrichment analysis. Hence, several MEA 
methods, which may overcome functional redundancy, 
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Figure 6. GO modules of breast cancer data. A, comparison with CHM. The Xaxis is the rank (down) and P value (up) of GO modules. The Y 
axis is -log m (P value) of individual terms calculated by CHM. The maximal value of the Kaxis corresponds to infinity. The dashed line denotes the 
P value cutoff of 0.05. Many non-significant terms are functional redundancy with the significant ones. GOMA collected them together as GO 
modules without the functional redundancy. B, similarity distribution of the GO term network and the average similarities of GO modules. The 
number in the rectangle is GO module rank and the arrow points to the corresponding average similarity. The high similarities for GO modules 
mean the densely subnetworks. C, the term structure of the first GO module shows more details about the functional relationships among the 
terms in one GO module. 



have been developed in recent years. By grouping terms 
as modules, MEA methods greatly simplify the results of 
enrichment analysis and improve the efficiency of 
downstream analysis. Some MEA methods, such as 
GenGO' 51 , do not explicitly model the similarity between 
terms, and as a result, the terms identified with these 
methods may not be functionally similar. With DAVID 161 , 



an MEA method, functionally similar terms are grouped 
into clusters, but the similarity of two terms is defined 
based on the number of common annotated genes in the 
given gene set. That is, with DAVID, only the local 
similarity of terms is considered: terms in the same 
cluster are similar in the given gene set but are not 
necessarily similar in the reference set. Conversely, with 
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GOMA, the global similarity of terms is considered, 
making the modules found more functionally similar and 
therefore more meaningful. Methods based on global 
similarity can also be used to find individually 
non-significant terms, which is difficult for methods 
based on local similarity. These terms are often closely 
related to significant terms and may be meaningful for 
downstream analysis. Another intuitive way to reduce the 
functional redundancy in the results of enrichment 
analysis is by clustering the significantly enriched terms 
found with term-centric enrichment analysis tools. With 
term-centric approaches, individually non-significant 
terms cannot be identified and the selected modules 
largely depend on a pre-selected cutoff. 

Currently, most MEA methods are based solely on 
term-term relationships. While gene-gene relationships 
are also considered in a few MEA methods, only limited 
information such as gene similarity is used. To our best 
knowledge, our previous method, NOA [10] , was the first 
real network-based enrichment analysis approach. With 
NOA, edge annotations are defined from gene 
annotations, and then the functional terms in edges are 
analyzed. In this way, NOA can be used to find specific 
enrichment results related to the network — results that 
cannot be obtained with traditional gene-based methods. 
However, NOA was only an initial approach in the field of 
network-based enrichment analysis, and there are many 
possible improvements. For example, the integration of 
NOA and GOMA is a promising direction for future 
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